Effective multi-modal clustering method via skip aggregation network for parallel scRNA-seq and scATAC-seq data

Abstract In recent years, there has been a growing trend in the realm of parallel clustering analysis for single-cell RNA-seq (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC) data. However, prevailing methods often treat these two data modalities as equals, neglecting the fact that the scRNA mode holds significantly richer information compared to the scATAC. This disregard hinders the model benefits from the insights derived from multiple modalities, compromising the overall clustering performance. To this end, we propose an effective multi-modal clustering model scEMC for parallel scRNA and Assay of Transposase Accessible Chromatin data. Concretely, we have devised a skip aggregation network to simultaneously learn global structural information among cells and integrate data from diverse modalities. To safeguard the quality of integrated cell representation against the influence stemming from sparse scATAC data, we connect the scRNA data with the aggregated representation via skip connection. Moreover, to effectively fit the real distribution of cells, we introduced a Zero Inflated Negative Binomial-based denoising autoencoder that accommodates corrupted data containing synthetic noise, concurrently integrating a joint optimization module that employs multiple losses. Extensive experiments serve to underscore the effectiveness of our model. This work contributes significantly to the ongoing exploration of cell subpopulations and tumor microenvironments, and the code of our work will be public at https://github.com/DayuHuu/scEMC.


INTRODUCTION
The advancements in single-cell transcriptomic sequencing technology have revolutionized transcriptome analysis, enabling biologists to delve into cellular heterogeneity with remarkable resolution at the single-cell level [1][2][3].Clustering analysis plays a pivotal role in transcriptome analysis, allowing for the unsupervised identification of cell subpopulations, which is crucial for downstream analyses.
Over the years, numerous attempts have been made to develop clustering methods for single-cell data [4,5].Initially, the focal points of research revolved around fundamental clustering models such as k-means clustering and spectral clustering [6,7], along with their enhanced variants.For instance, Chen et al. proposed a weighted soft k-means clustering model tailored for single-cell data, replacing the original hard clustering with a soft one [8].While these methods achieved some success, they often struggled to extract nonlinear features from the cell interactions.With the development of deep learning, researchers began to explore the realm of deep neural networks for clustering analysis.Notably, DESC emerged as a representative work, utilizing neural networks to learn meaningful representations while effectively mitigating batch effects [9].Furthermore, scDeepCluster employed an autoencoder network to concurrently conduct noise reduction and clustering for single-cell data [10].These deep clustering approaches made significant progress but overlooked the topological information among cells.In response to this limitation, graph-based deep clustering algorithms were developed, benefiting from the interactions among cells.Chen et al. introduced scGAC [11], which employed a graph attention network to execute clustering analysis.Gan et al., recognizing the significance of both attributes and topological information, proposed a deep structural clustering model scDSC [12], capable of simultaneously addressing these aspects.Despite their promising performance, these single-modal methods encountered limitations when handling multi-modal single-cell data.
Multi-modal single-cell data refers to the data obtained by sequencing the same batch of cells using different omics technologies [13][14][15][16].Currently, the parallel analysis of single-cell RNA-seq (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC) is a common scenario.With the rapid development of sequencing technologies, the availability of multimodal data is increasing.By leveraging the distinct modalities of the same cell, we can gain more comprehensive insights into cellular states.In recent years, several parallel clustering methods have been developed for scRNA and scATAC data.For instance, scMVAE presents a Multimodal Variational Autoencoder (MVAE), imbued with three learning strategies for inferring the distribution of multi-modal cell data [17].This field has seen extensive use of MVAE, Gong et al. utilized datasets of various modalities as inputs, applying MVAE for joint representation estimation, and performing clustering and visualization on the derived representation [18].Simultaneously, Cao et al. introduced the SAILERX deep learning framework, which diverges from conventional approaches [19].This method promotes local structural similarity between the modalities through paired similarity assessments, thereby effectively diminishing the impact of noise signals.Furthermore, Xu et al. developed a transfer learning method to identify generalizable chromatin interactions in scATAC-seq data [20].Moreover, DCCA introduces an ingenious cycle attention model, designed specifically for the unified analysis of multi-omic cell data [21].Inspired by the principles of subspace clustering, scMCS extends it to the realm of singlecell clustering [22], enabling the effective clustering of parallel single-cell data by diligently minimizing redundancy across subspaces.
However, prevailing parallel clustering methods for scRNA and scATAC often overlook the fact that scATAC data exhibit lower information richness compared to scRNA data [23,24].They treat the data from both modalities as equal inputs, disregarding the inherent differences in data effectiveness and sparsity.Consequently, in many cases, the clustering performance of the fused data from both modalities is even inferior to using only the scRNA data.This could be attributed to the low information richness of the scATAC modality, where the fusion process is challenged by the sparse information in scATAC data, resulting in poor quality of the aggregated cell representations for clustering.Existing methods face challenges in effectively integrating parallel scRNA and scATAC data, concurrently struggling to adequately fit the real distribution of single-cell data.
In light of the aforementioned points, we develop an effective multi-modal clustering model (scEMC), which integrates parallel scRNA and scATAC data while ensuring the quality of the aggregated cell representations.The proposed skip aggregation network (SAN) network extracts structural information from multiple modalities and facilitates cross-modal information fusion, simultaneously connecting with scRNA modality data via skip connection to promise that the fused cell representations do not suffer significant performance degradation.Additionally, to accurately model the distribution of real single-cell data, we have devised a Zero-Inf lated Negative Binomial (ZINB)-based denoising autoencoder accompanied by a joint optimization module.Experimental results on five benchmark datasets demonstrate the stability and superior performance of the scEMC method, outperforming eight other baseline methods.The contributions of our work can be summarized as follows: • We propose an effective parallel clustering framework scEMC, which mitigates the impact of unbalanced information richness of scRNA and scATAC data.

Preliminary
Multi-modal single-cell data refer to data obtained from the sequencing of the same batch of cells using multiple sequencing technologies.In this study, our primary focus lies in the parallel clustering of scRNA and scATAC data.Parallel clustering involves the simultaneous preprocessing of scRNA and scATAC data, followed by feature integration.scRNA and scATAC are interrelated in the processing phase, rather than being completely independent.To facilitate the illustration, we provide a clear mathematical description for them.The scRNA data is represented as X r = {x r 1 ; ...; x r N } ∈ R N×Dr , while scATAC denoted as X a = {x a 1 ; ...; x a N } ∈ R N×Da , where N signifies the number of cells, D r andD a denote the feature dimensions of the scRNA and scATAC modal data, respectively.

The framework of scEMC
The architecture of scEMC aims to learn effective cell representations across multiple modalities and mitigate the impact of imbalanced data richness in diverse modalities, which is crucial for conducting parallel clustering.As illustrated in Figure 1, it consists of two main modules: a ZINB-based denoising autoencoder for generating cell representations and an SAN module for aggregating multi-modal information and preventing network degradation.For clearer understanding, the notations are presented in Table 1.
The implementation process can be divided into four steps: • First, the original scRNA and scATAC data, augmented with simulated noise, are fed into the denoising autoencoder.

Figure 1.
Illustration of the framework of scEMC.The whole process is divided into four stages.Following the introduction of Gaussian noise, the original numerical matrices of scRNA and scATAC data are input into an autoencoder.Afterward, the cell embeddings Z r and Z a from both modalities undergo effective fusion via an SAN, with a focus on preserving the utmost information from the informative scRNA modality.The final cell representations Z derived from the SAN network are decoded to produce three data distributions.These distributions are employed to calculate the ZINB loss, which is jointly optimized alongside the clustering loss of the representations.
Subsequently, they are embedded into a lower-dimensional space, and the resulting embeddings are concatenated to build a shared embedding Z. • Inspired by the transformer architecture, Z is mapped into three independent feature spaces.We retain one of them H 3 for learning transformations of the original features, while the other two H 1 and H 2 are used to compute global structural relationships among cells.This process results in the generation of a global structural enhanced embedding, denoted as Ẑ.It is subsequently concatenated with the original scRNA embedding through skip connections, aiming to preserve the information-rich scRNA modality data.This yields the aggregated skip embedding Z. • The embedding produced by the SAN module then undergoes an intuitive decoding process, where it is decoded into distinct modalities using two separate decoders.
• Finally, three distributions, namely Dropout, Dispersion and Mean, are computed from the decoded embeddings.These distributions are then utilized to calculate the ZINB loss for different modalities.It serves as the reconstruction loss, which, together with the clustering loss, jointly optimizes the cell representations.By leveraging end-to-end training and real-time optimization, we obtain high-quality cell representations capable of achieving unsupervised clustering with high accuracy.

ZINB-based distribution
To simulate the distribution of real cells and learn effective cell representations, we employ a denoising autoencoder based on the ZINB loss.Since real-world single-cell data often contain noise, we augment the input multi-modal data X r and X a with Gaussian noise.The process is denoted as follows: where n r and n a represent the simulated Gaussian signals added to the scRNA and scATAC data, respectively, with a mean of 0 and a variance of 1. σ r and σ a are the weight coefficients that control the inf luence of n r and n a , respectively.The perturbed parallel single-cell data Xr and Xa is fed into a multi-modal autoencoder, wherein it is embedded into a lowerdimensional feature space, as represented by the following equation: here f r e (•) and f a e (•) correspond to the encoder mappings for scRNA and scATAC data, respectively.While Z r and Z a represent the resulting low-dimensional embeddings from both modalities.These embeddings are subsequently transformed by the SAN module, resulting in the generation of the final shared embedding, denoted as Z.
Upon this basis, we have introduced the ZINB distribution to estimate the distribution of single-cell data [25][26][27].Despite the ZINB loss not being specifically designed for scATAC-seq data, its proficiency in addressing over-dispersion and data sparsity renders it an appropriate selection.Following the approach of Lin et al., this work models both scRNA and scATAC data using ZINB Decoders for the scRNA and the scATAC.
loss [24].Nevertheless, this study does not claim that ZINB distribution is the optimal distribution for scATAC data, researchers are encouraged to consider a loss function that may be more appropriate for scATAC data.Prior to constructing the ZINB distribution, the Negative Binomial (NB) distribution is initially computed, which is a type of discrete distribution.Since this work involves two modalities of data, we will elucidate the equation using an example from the scRNA modality X r : here π , μ, θ denote the dropout rate, dispersion degree, and mean, respectively.Deviating from a conventional autoencoder, the ZINB-based denoising autoencoder incorporates three separate fully connected layers that are connected to the last layer of the decoding network.This architecture aims to estimate the parameters π , μ, θ within the shared embedding Z, which is denoted as below: f r d represents a fully connected decoding neural network.W r π , W r μ and W r θ are three learnable weight matrices corresponding to three parameters in the ZINB distribution., M, are parameter matrices representing the dropout rate, mean and dispersion, respectively.It is worth noting that the dropout rate typically ranges between 0 and 1, which accounts for why we employ the sigmoid function sigmoid(•) for .Similarly, we apply the exponential function exp(•) to the other two parameters because of their non-negativity.Finally, the negative log-likelihood of the ZINB distribution is defined as the reconstruction loss for the input data X r , and its mathematical form is as follows: the computational process for scATAC is similar to that of scRNA and can be represented as follows: ultimately, for parallel scRNA and scATAC data, the overall reconstruction loss of ZINB-based denoising autoencoder is defined as follows:

Skip aggregation network
Given the disparity in data richness between the scRNA and scATAC modalities, parallel analysis necessitates an effective aggregation method to handle the multi-modal data.Therefore, we introduce the SAN module that begins with the concatenation of the embeddings from both modalities, as depicted below: Inspired by the transformer architecture [28][29][30], we have designed a similar structure to map the concatenated embeddings Z into three separate feature spaces.The mapping process is as follows: here, W t 1 , W t 2 and W t 3 represent three weight matrices used for the mapping transformation, while H 1 , H 2 and H 3 denote three obtained embeddings via the mapping process.Subsequently, H 1 and H 2 are utilized to compute a global relationship matrix H s : d represents the dimension of the embedding Z. Afterward, the preserved embedding H 3 is enhanced by global relationship matrix H s , simultaneously combined with Z through skip connections to prevent network degradation.The mathematical process is as follows: where W h denotes weight matrix for the skip transformation, b represents the corresponding bias.Then we obtained the aggregated embedding Ẑ.To preserve the rich information of scRNA and prevent degradation of the aggregated representation, we concatenate the aggregated representation Ẑ with the original scRNA embedding Z r , this forms the basis of the proposed skip module.Such an adjustment effectively transforms the aggregation module into a fine-tuning mechanism tailored for scRNA data.Consequently, this approach not only utilizes information from multiple modalities but also ensures that the final cellular representation is robust against the sparse data characteristic of a single modality.The formula is as follows:

Joint optimizing module
During the training process, reconstruction and clustering loss are employed for joint optimization.We minimize the following overall objective function: where L rec and L clu represent the clustering loss and reconstruction loss, respectively, while λ 1 and λ 2 are two hyperparameters that balance their contributions.The reconstruction loss L rec has been previously described in detail.On the other hand, the clustering loss, L clu , can be further decomposed into two components: the Kullback-Leibler (KL) divergence loss and the deep k-means loss.

KL divergence on the cell representations
During the clustering process, cells with similar features are assigned to the same cluster.In this work, we employ the KL divergence loss to further enhance the correlation among similar cells.Following the previous approach [31][32][33], we utilize the Student's t-distribution to depict the pairwise similarity between cell i and cell j, as presented below: here, Zi and Zj denote the emdedding of cell i and j, q ij represents the soft assignment, measuring the pairwise similarity between two cells, i and j.Additionally, p ij is the target distribution, constructed based on q ij .This construction is designed to enhance or diminish the affinities between cells with higher and lower similarities, respectively.The computational process is as follows: Upon acquiring two distributions, we formulate the KL divergence loss as a means to converge Q towards P. The expression for the KL loss function is as follows:

Deep k-means clustering
In the bottleneck layer, also known as the hidden layer, we performed unsupervised clustering and the clustering loss is defined as follows: where, V j represents the j-th cluster center, f (•) calculates the Euclidean distance between the cell and the cluster center.While w ij represents the weight of distance.To ensure gradient smoothness, the Gaussian kernel is employed for the transformation of feature projections, following the procedure outlined below: To facilitate convergence, an additional inf lation operation is incorporated for the weight wij : Then, we amalgamated the KL divergence loss and the deep kmeans distance loss to form the ultimate clustering loss L clu :

Datasets
In this parallel clustering analysis, we conducted comprehensive experiments on five authentic multi-modal single-cell datasets: BMNC (https://www.ncbi.nlm.nih.gov/geo),PBMC (https://www.10xgenomics.com/resources/datasets),SLN111 (https://github.com/YosefLab/totalVI_reproducibility), SMAGE-10K 2 , SMAGE-3K 2 .The data sources have been delineated in the footnotes, and corresponding cell type labels were downloaded.The cluster number was determined by the categories of the downloaded cell type labels.All of these datasets encompass both scRNA and scATAC sequencing for the same batch of cells.In cases where the datasets have already undergone dimensionality reduction by the original authors, we will employ their processed forms.For datasets that have not yet been dimensionally reduced, we achieve standardization by limiting the feature count to 2000, thus ensuring consistency.An overview of the dataset information, including the number of cells, dimensions of the scRNA data, dimensions of the scATAC data and the number of clusters, is provided in Table 2.

Evaluation metrics
This work employed two widely used evaluation metrics, adjusted Rand index (ARI) and normalized mutual information (NMI), to evaluate the clustering performance.ARI is a measure of similarity between two clusterings, which ranges between −1 and 1, with closer to 1 indicating higher consistency.It can be formulated as follows: Furthermore, NMI is a normalized mutual information metric used to measure the shared information between two clusterings, which ranges between 0 and 1, with closer to 1 indicating higher similarity.Its mathematical representation is as follows:

Implementation details
We employ PyTorch (version 1.13.1) in Python 3.7 to implement scEMC.The encoding layers of the ZINB-based denoising autoencoder are set (256, 64, 32, 8), with a bottleneck layer size of 8 for both scRNA and scATAC.The aggregated bottleneck layer size is 24, while the batch size is set to 256.Initially, we conduct the pretraining for 400 epochs, followed by 5000 epochs of training.These experiments are performed on a personal computer running the Linux operating system, which is configured with an i9-12900KF CPU, 64 GB of RAM and a GeForce RTX 3070Ti GPU.It is important to note that our algorithm utilizes k-means, so the user needs to manually specify the number of clusters before running the algorithm.For the baseline methods, we conducted the kmeans and spectral clustering algorithms utilizing the scikit-learn package.Regarding the other comparative methods, we followed the implementations as delineated in their respective official repositories.For all methods employed, parameter settings were maintained as per the default configurations.

RESULTS
scEMC attains outstanding clustering performance.
To comprehensively evaluate the clustering performance of our scEMC, in this work, we conduct thorough experimentation across five multi-modal single-cell datasets, along with the inclusion of eight competitive methods.These competitive methodologies can be categorized into three groups, multi-modal clustering methods: scMVAE, scMCs, DCCA; single-modal clustering methods: scDSC, scDeepCluster, DESC; foundational clustering methods: spectral clustering and k-means clustering.A brief introduction to these approaches is presented below: • scMVAE [17]: deep-joint-learning analysis model of singlecell transcriptome and open chromatin accessibility data.• scMCs [22]: scMCs: a framework for single-cell multi-omics data integration and multiple clusterings.
As depicted in Table 3, the clustering performance of scEMC and the eight competitive methods is quantitatively evaluated by ARI and NMI.The results indicate that scEMC surpasses other clustering algorithms significantly.Over a series of 10 evaluations, it consistently achieved the top position 9 times, yielding only the

scEMC effectively integratessparse information from the scATAC modality
The motivation for our study arises from the observation that the application of data from multiple modalities in multi-modal A deeper investigation into this pattern revealed that the quality of scRNA modality data generally exceeds that of scATAC data.As a result, the comparatively lower quality of scATAC data can negatively inf luence the overall performance of the model.To solve this issue, we developed the SAN network, which shifts our model from an equal integration strategy to a refined tuning mechanism primarily based on scRNA modality data.
To assess the SAN network's effectiveness in leveraging the lower-quality scATAC data, we executed comparative experiments on the five datasets involved in our study.These experiments were divided into two categories: one without incorporating scATAC modality information and the other including all data.The outcomes, illustrated in Figures 2 and 3, reveal a noticeable decrease in model performance when scATAC modality information is excluded.This finding highlights that the SAN network not only diminishes the adverse effects of low-quality scATAC data on the model but also well consolidates sparse information from the scATAC modality, thereby improving clustering accuracy.

scEMC learns effective cell representations in latent space
In this research, plenty of computations take place within the latent space.Consequently, the quality of cell representations directly exerts inf luence on clustering performance.
To investigate whether scEMC has learned high-quality cell representations, we retained the hidden layers of scEMC and its various variants, visualizing them through t-SNE on the PBMC dataset.These variants included four different absences: removal of the aggregation module, skip connections with scRNA, clustering loss, and reconstruction loss.For the sake of illustration, we use the abbreviation w.o. to signify the absence of these modules.
As shown in Figure 4, we observed that once the skip connection with scRNA data was removed, the quality of the learned cell representations drastically declined, resulting in chaotic clusters that make it challenging to distinguish between different cell types.Simultaneously, when the structural aggregation module is removed, cells that do not belong to the same class are grouped into one cluster.This implies that the structural aggregation module effectively enhances the quality of the learned embeddings.Furthermore, the removal of clustering loss or reconstruction loss leads to a certain degree of degradation in cell representations' quality, demonstrating the effectiveness of the loss functions we have employed in optimizing the cell representations.

Ablation study
To further investigate the individual impacts of proposed modules on the overall performance, we conducted comprehensive ablation experiments.
Specifically, we constructed two sets of variants of scEMC and compared their clustering performance.The first set of variants was created to validate the effectiveness of the network structure.In this set, we devised two variants: one that removed the structural aggregation mechanism and another that eliminated skip connections with the scRNA data.The results are presented in Table 4, and from the results, it is evident that both the removal of the structural aggregation mechanism and skip connections with the scRNA data resulted in a significant degradation in performance.This indicates that the structural aggregation mechanism effectively integrates information from both modalities, while  process is prematurely terminated.Nevertheless, the results in Figure 5 robustly confirm the effectiveness of optimization and the convergence of our algorithm.

Parameter analysis
In the previous section, we built two hyperparameters, denoted as λ 1 and λ 2 , to measure the contribution between clustering loss and reconstruction loss.
Here, we comprehensively assessed the inf luence of these hyperparameters on the clustering performance of scEMC.The experiments were conducted under various parameter sets, with both parameters ranging from (0.01, 0.1, 1, 10, 100).The three-dimensional visualizations of the results are presented in Figures 6 and 7.
From Figures 6 and 7, it becomes apparent that the NMI performance of the proposed scEMC algorithm remains not sensitive to the parameter values within this range.The performance exhibits minimal f luctuations, with only a marginal decline observed when λ 1 is set to 0.01 and λ 2 is set to 100.Conversely, Figures 6  and 7 reveals that the ARI performance of the scEMC algorithm is sensitive to λ 1 within the range of 0.01 to 1. Optimal performance is achieved at λ 1 =1 and λ 2 =10.This phenomenon might be attributed to the fact that ARI is based on the consistency in categorizing paired elements, whereas NMI relies on information sharing, thereby rendering it relatively insensitive to changes in parameters when compared to ARI.Based on experience, we configure λ 1 and λ 2 in accordance with this optimal parameter setting.

CONCLUSION
In conclusion, we have developed an effective parallel clustering method, scEMC, tailored for scRNA and scATAC data.It leverages the transformer architecture to learn cross-modal global structural information from parallel single-cell data and facilitates the fusion of cross-modal information.Additionally, by incorporating skip connections that link with scRNA modality data, scEMC prevents the network from degrading.This skip mechanism effectively preserves richer scRNA data, while the designed denoising autoencoder based on ZINB optimally fits single-cell data and refines the cell representations.Experimental results demonstrate that our model outperforms other methods in terms of clustering performance.
Furthermore, there remain certain limitations that necessitate our attention.Currently, our proposed framework primarily considers the parallel analysis of scRNA and scATAC data.More sequencing modalities can be integrated into our framework in the future.Additional fusion strategies, such as concatenation and ensemble learning [34][35][36], can be incorporated to enhance the aggregation capabilities of our framework.Additionally, the design of a more discriminative network structure might be a potentially effective direction to improve this model.

Key Points
• We propose an effective parallel clustering framework scEMC, which mitigates the impact of unbalanced information richness of scRNA and scATAC data.

Figure 2 .
Figure 2. Assess the model's performance both with and without scATAC information using the ARI.

Figure 3 .
Figure 3. Assess the model's performance both with and without scATAC information using the NMI.

Figure 4 .
Figure 4. 2D t-SNE visualization showcasing the clustering quality disparities among embeddings in the absence of various modules, on the PBMC dataset.(A) w.o.aggregation module.(B) w.o.skip module.(C) w.o.clustering loss.(D) w.o.reconstruction loss.(E) scEMC with all modules preserved.
• Different from previous methods, we have introduced a pioneering SAN module that incorporates transformer structure to learn the global structural relationships between diverse feature spaces, facilitating aggregation across different modalities.Moreover, we create a skip connection between the aggregated representation and the scRNA modality data to safeguard the network from degradation.•By leveraging a denoising autoencoder based on the ZINB loss, scEMC enables the network to fit the real distribution of single-cell data.Extensive experiments demonstrate the excellence of scEMC, surpassing the other benchmark methods.

Table 2 :
The summary of datasets

Table 3 :
Clustering result comparison for five datasets.